Analysis Report for a given text dataset sample

1.0 Data set:-

data_file   <- "/Users/zzahir1978/Desktop/Sample data/en.sahih.txt"
##            Name Verse
##   1: Al-Fatihah     7
##   2: Al-Baqarah   286
##   3:  Ale Imran   200
##   4:   An-Nisa'   176
##   5: Al-Ma'idah   120
##  ---                 
## 110:    An-Nasr     3
## 111:   Al-Masad     5
## 112:  Al-Ikhlas     4
## 113:   Al-Falaq     5
## 114:     Al-Nas     6
##                  V1        V2
## 1:   Data Size (MB)      0.86
## 2:      Nos.Of Line   6249.00
## 3: Nos.Of Character 891800.00
## 4:     Nos.Of Words 158992.00


2.0 Compute sample sizes in terms of lines

##    data_size
## 1:    4374.3

3.0 Text Data Analysis Results


3.1 Most frequent and least frequent words

3.1.1 Top 10 most frequent words

##       word count
##  1:  allah  2065
##  2:   will  1664
##  3: indeed  1044
##  4:   lord   670
##  5:   said   567
##  6:    say   547
##  7: people   508
##  8:   upon   443
##  9: except   333
## 10:  among   333

3.1.2 Ten Least frequent words

##          word count
##  1:    losers    31
##  2: competent    31
##  3:    former    31
##  4:     thing    31
##  5:      eyes    31
##  6: criminals    32
##  7:     wrong    32
##  8:  grateful    32
##  9: mountains    32
## 10:     gives    32

3.1.3 Plotting 10 Most Frequent Words

3.1.4 Plotting 10 Least Frequent Words


3.1.5 Creating Words Cloud


4.0 Session info

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] twitteR_1.1.9      forcats_0.5.0      stringr_1.4.0      purrr_0.3.4       
##  [5] tibble_3.0.3       tidyverse_1.3.0    tidyr_1.1.2        readr_1.3.1       
##  [9] dtplyr_1.0.1       wordcloud_2.6      RColorBrewer_1.1-2 ggthemes_4.2.0    
## [13] ggplot2_3.3.2      data.table_1.13.0  knitr_1.30         dplyr_1.0.2       
## [17] ngram_3.0.4        tm_0.7-7           NLP_0.2-0         
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5       lubridate_1.7.9  assertthat_0.2.1 digest_0.6.25   
##  [5] slam_0.1-47      R6_2.4.1         cellranger_1.1.0 backports_1.1.10
##  [9] reprex_0.3.0     evaluate_0.14    httr_1.4.2       pillar_1.4.6    
## [13] rlang_0.4.7      readxl_1.3.1     rstudioapi_0.11  blob_1.2.1      
## [17] rmarkdown_2.4    labeling_0.3     bit_4.0.4        munsell_0.5.0   
## [21] broom_0.7.1      compiler_4.0.2   modelr_0.1.8     xfun_0.18       
## [25] pkgconfig_2.0.3  htmltools_0.5.0  tidyselect_1.1.0 fansi_0.4.1     
## [29] crayon_1.3.4     dbplyr_1.4.4     withr_2.3.0      grid_4.0.2      
## [33] jsonlite_1.7.1   gtable_0.3.0     lifecycle_0.2.0  DBI_1.1.0       
## [37] magrittr_1.5     scales_1.1.1     cli_2.0.2        stringi_1.5.3   
## [41] farver_2.0.3     fs_1.5.0         xml2_1.3.2       ellipsis_0.3.1  
## [45] generics_0.0.2   vctrs_0.3.4      rjson_0.2.20     tools_4.0.2     
## [49] bit64_4.0.5      glue_1.4.2       hms_0.5.3        parallel_4.0.2  
## [53] yaml_2.2.1       colorspace_1.4-1 rvest_0.3.6      haven_2.3.1